A Fault-tolerance Linguistic Structure for Distributed Applications
نویسنده
چکیده
The structures for the expression of fault-tolerance provisions into the application software are the central topic of this dissertation. Structuring techniques provide means to control complexity, the latter being a relevant factor for the introduction of design faults. This fact and the ever increasing complexity of today’s distributed software justify the need for simple, coherent, and effective structures for the expression of fault-tolerance in the application software. A first contribution of this dissertation is the definition of a base of structural attributes with which application-level fault-tolerance structures can be qualitatively assessed and compared with each other and with respect to the above mentioned need. This result is then used to provide an elaborated survey of the state-of-the-art of software fault-tolerance structures. The key contribution of this work is a novel structuring technique for the expression of the fault-tolerance design concerns in the application layer of those distributed software systems that are characterised by soft real-time requirements and with a number of processing nodes known at compile-time. The main thesis of this dissertation is that this new structuring technique is capable of exhibiting satisfactory values of the structural attributes in the domain of soft real-time, distributed and parallel applications. Following this novel approach, beside the conventional programming language addressing the functional design concerns, a special-purpose linguistic structure (the so-called “recovery language”) is available to address error recovery and reconfiguration. This recovery language comes into play as soon as an error is detected by an underlying error detection layer, or when some erroneous condition is signalled by the application processes. Error recovery and reconfiguration are specified as a set of guarded actions, i.e., actions that require a pre-condition to be fulfilled in order to be executed. Recovery actions deal with coarse-grained entities of the application and pre-conditions query the current state of those entities. An important added value of this so-called “recovery language approach” is that the executable code is structured so that the portion addressing fault-tolerance is distinct and separated from the rest of the code. This allows for division of complexity into distinct blocks that can be tackled independently of each other. This dissertation also describes a prototype of a compliant architecture that has been developed in the framework of two ESPRIT projects. The approach is illustrated via a few case studies. Some preliminary steps towards an overall analysis and assessment of the novel approach are contributed by means of reliability models, discrete mathematics, and simulations. Finally, it is described how the recovery language approach may serve as a harness with which to trade optimally the complexity of failure mode against number and type of faults being tolerated. This would provide dynamic adaptation of the application to the variations in the fault model of the environment. List of abbreviations Abbreviation Meaning Section Page A Adaptability 1.1.2 7 ALFT Application-Level Fault-Tolerance 1.1 1 AMS Algorithm of Mutual Suspicion 5.2.4 80 AOP Aspect-Oriented Programming 3.4 42 APL ENEL application component 6.3.2 126 AS Alarm Scheduler B.1 177 AT Alarm Thread B.1 177 AW Algorithmic Worker Processes 6.1.3 113 BB TIRAN Backbone 4.2.1 52 BSL TIRAN Basic Services Library 4.2.1 52 BSW ENEL “basic software” component 6.3.2 126 BT TIRAN Basic Tool 4.2.1 52 COTS Commercial-Off-The-Shelf 4.1.1 50 CSP Communicating Sequential Processes 3.5 45 DB TIRAN Database 4.2.1 52 DIR net EFTOS Detection, Isolation and Recovery network 3.1.1 27 DM TIRAN Dependable Mechanism 5.1 71 DV TIRAN Distributed Voting Tool 5.2.3 74 EE -Version Executive 3.1.2.2 33 EFTOS Embedded Fault-Tolerant Supercomputing, ESPRIT project 21012 3.1.1 26 EM ENEL Exchange Memory component 6.3.2 127 EMI Electro-Magnetic Interference 1.1.2 6 FSM Fail-Stop Modules 3.3.1.4 39 FTAG Fault-Tolerant Attribute Grammars 3.3.2.3 40 GLIMP Siemens Gray Level IMage Processing package 6.1.1 112 ID Image Dispatcher 6.1.3 113 IMP Siemens Integrated Mail Processor 6.1 110 LCL Local Control Level 6.3.1 126 MC Machine Control 6.1.1 112 MOP Metaobject Protocol 3.2 35 MV Multiple-Version software fault-tolerance 3.1.2 29 NMR -Modular Redundancy 5.2.3 74 Abbreviation Meaning Section Page NVP -Version Programming 3.1.2.2 32 OCR Optical Character Recognition module 6.1.1 111 OS Operating System 4.1.1 50 PC Personal Computer 5.1 69 PS Primary substation 6.3.1 125 PSAS Primary substation automation system 6.3.1 126 RB Recovery Blocks 3.1.2.1 30 RD Region Descriptor 6.1.3 113 REL REcovery Language approach 4 49 RINT Recovery Interpreter 5.3.3.2 105 RM Result Manager 6.1.3 113 RMP Recovery Meta-Program 3.5 44 ROI Region Of Interest 6.1.1 112 RW Redundant Watchdog 6.4.1 133 SA Syntactical Adequacy 1.1.2 6 SC Separation of the design Concerns 1.1.2 6 SV Single-Version software fault-tolerance 3.1.1 26 TIRAN TaIlorable fault-toleRANce frameworks for embedded applications, ESPRIT project 28620 1.1.3 7 TMR Triple Modular Redundancy 5.3.3.1 103 TOLM Time-Out List Manager B.3 180 TOM TIRAN Time-Out Manager 4.2.1 53 VITA EFTOS vertical integration test application 6.1.5 115 WWW World-Wide Web 3.1.1 27 List of the symbols used and of their definitions Symbol Definition Section Page = Availability (probability that a service is operating correctly and is available to perform its functions at time ) 2.1.2.3 14 = MTTF MTTF MTTR 2.1.2.3 14 = Steady-state availability = Failure behaviours of failure class 2.1.3.1 16 = Error recovery coverage 7.1.1 137 correct = Event “the system provides its service during time interval ” 2.1.2.1 13 = 1 if is true, 0 otherwise 7.2.1.1 145 !#" = Efficiency 7.2.1.1 145 = Percentage of slots used during a run fail = $&%(' *) $ 2.1.2.1 13 = Failure density function +-, = Set .0/ 212131547698;:< 4>=?8@ 4.1.1 50 A = Failure rate 2.1.2.1 13 A " = Length of a run with B7C 8 processors 7.2.1.1 145 D = 8E6GF HJI(KML (if N is constant) 2.1.2.4 14 = Maintainability (probability that a failed system will be repaired in a time less than or equal to ) MTBF = MTTF + MTTR 2.1.2.2 14 = Mean Time Between Failure MTTF = OQP RTS VUW 2.1.2.2 13 = Mean Time To Failure MTTR = Mean Time To Repair 2.1.2.2 14 = Average time required to repair a system N = Repair rate 2.1.2.2 13 N " = Average slot utilisation 7.2.1.1 145
منابع مشابه
$\mathcal R\!\raise2pt\hbox{$\varepsilon$}\!\hbox{$\mathcal L$}$: A Fault Tolerance Linguistic Structure for Distributed Applications
The embedding of fault tolerance provisions into the application layer of a programming language is a nontrivial task that has not found a satisfactory solution yet. Such a solution is very important, and the lack of a simple, coherent and effective structuring technique for fault tolerance has been termed by researchers in this field as the “software bottleneck of system development”. The aim ...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملREL: A Fault Tolerance Linguistic Structure for Distributed Applications
The structures for the expression of fault-tolerance provisions into the application software are the central topic of this dissertation. Structuring techniques provide means to control complexity, the latter being a relevant factor for the introduction of design faults. This fact and the ever increasing complexity of today’s distributed software justify the need for simple, coherent, and effec...
متن کاملارائه یک رویکرد همانند سازی شده عامل محور در اجرای یک الگوی کد متحرک مطمئن
Abstract Using mobile agents, it is possible to bring the code close to the resources, which is not foreseen by the traditional client/server paradigm. Compared to the client/server computing paradigm, the greater flexibility of the mobile agent paradigm comes at additional costs as well as the additional complexity of developing and managing mobile agent-based applications. Such complexity ...
متن کاملUsing Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications
As part of the Legion metacomputing project, we have developed a reflective model, the Reflective Graph & Event (RGE) model, for incorporating functionality into applications. In this paper we apply the RGE model to the problem of making applications more robust to failures. RGE encourages system developers to express fault-tolerance algorithms in terms of transformations on the data structures...
متن کاملDrago: An Ada Extension to Program Fault-Tolerant Distributed Applications
This paper describes Drago, an experimental language designed to support the implementation of fault-tolerant distributed applications. The language is the result of an eeort to impose discipline and give linguistic support to the main concepts of Isis, as well as to experiment with the group communication paradigm. Drago has been designed and implemented as an extension to Ada 83. In this pape...
متن کامل